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Naval  Oeaan  Syataaa  Cantar  haa 
Invastlgatad  tha  potantial  of  tha  ayatollc 
arehltaetura  for  algnal  procaaalng 
applieationa  ainca  tha  ooncapt  waa 
Introduead  by  H.T.  Kung  In  1076*^^  Thla 
highly  parallal  arehltaetura  of  naaraat 
naighbor  data  eoaaunleatlon  and  rapaatad 
procaaalng  noda  atructura  proalaaa  a 
favorabla  aarrlaga  of  VLSI  wafar  acala 
Intagratlon  and  Matrix  bacad  algnal 
procaaalng  algorlthaa.  Tha  auocaaaful 
■arging  of  the  taehnology  with  tha 
■athaatlcal  ooncapta  of  alganvactor 
dacoapoai t Ion,  alngla  valua  dacoapoaltlon, 
or  orthogonal  factorization  naeaaaitataa  a 
oaraful  atudy  of  a  larga  nuabar  of 
archl taetura I  laauaa.  Functional  faetora 
aaaoelatad  with  the  daalgn  of  a  ayatollc 
procaaalng  alaaant  auat  ba  addraaaad.  For 
inatanea,  ahould  blt-aarlal  or  bit 
parallal  eoaputatlon  ba  utlllzad.  Doaa 
tha  dynaalc  ranga  of  tha  eandldata 
applieationa  or  nuaarieal  atablllty  of  tha 
algorlthaa  uaad  raqulra  eoaputationa  In 
flxad  point  and  Intagar  format  or  tha 
archltactural ly  aora  eoaplax  and  alowar 
floating  point  foraat.  Tha  ralatlonahlp 
of  Input/output  data  flow  rata  and 
aanagaaant  and  tha  intarnal  computational 
apaad  oust  bo  atudiod  in  aaaaaalng  tha 
eoaploxlty  of  tha  procaaalng  alaaant. 


Daalgn  faetora  bearing  on  tha  type  of 
ayatollc  arehltaetura  uaad  are  alao 
derived  froa  tha  need  to  fully  utilize 
each  proeeaaing  alaaenta  In  the  array. 
Tha  number  of  alaaenta,  the 
intareonnaction  aehaae,  the  amount  of 
local  or  global  program  Intelligence  auat 
be  aatabllahed  with  conalderatlon  of  the 
algorlthaCal  to  bo  mapped  onto  the 
architecture. 


Initial  work  perforaed  at  NOSC^ 
reaultad  In  an  0  X  6  ayatollc  array 
teatbad  and  devalopaant  aoftwaro  to  aid  in 
the  aubaequent  algorltha  mapping  efforta. 
The  array  waa  built  froa  of f -the-ahe I f 
al croproceaeor  baaed  procaaeor 
eonponentry.  The  architecture  waa  vary 
flaxibla  and  prograaaabla  and  aarved  as  a 
good  platfora  to  address  the  design  issues 


described  earlier.  The  llaitad  data 
Input/output  throughput  structure  and 
aodarata  clock  spead  of  this  testbed 
Halted  Its  usafulnasa  for  raal-tlaa 
signal  processing  applications. 


NOSC  Is  presently  Investigating  the 
application  of  systolic  architectures  to 
adaptive  baaaforaing.  Using  tha 
background  derived  froa  tha  original 
tastbad  Investigations  and  the  parforaanca 
raquiraaants  associated  with  a  chosen 
baaaforaing  algorltha,  a  second  generation 
ayatollc  array  processor  has  bean  designed 
and  built.  A  description  of  tha 
algorltha,  a  special  varaion  of  the 
BOdlfied  Graa-Schaldt  or thogona 1 1 zer ,  and 
Its  Intended  real-tine  adaptive 
baarnfornlng  application  is  beyond  the 
scope  of  this  paper.  Tha  archltactural 
raquirnanta  laposed  on  the  systolic  array 
by  the  anticipated  algorltha  to  be  mapped, 
however,  will  be  described  here. 


The  systolic  array  testbed  systea  ,  figure 
1,  la  coaposed  of  16  Arithmetic  Processing 
Nodules  (APM),  4  Input/Output  Modules 
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Figure  1  Systolic  Testbed  Systea 
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<IOH),  •  Systaa  Control  Hodulo  <SCI1>,  ond 
tho  •/■!•■  boat  control  eoaputar  (an  IBM- 
AT  for  thla  Initial  eonf 1 furat Ion) .  Tha 
16  APM*a  ara  ayatolteally  eonnoetad  via 
orthogonal  AO-blt  bl-dlraetional  parallal 
data  buaaa  to  adjacant  APH*a  or  to  an  lOn 
on  oaeh  of  tha  boundarloa  of  tha  4X4 
squara  array.  Data  eoaaunleatton  to 
hardwara  aatarnal  to  tho  array  ooour  via 
tha  4  aatarnal  porta  ( top, bottoa, r ig ht, and 
laft).  An  additional  40-blt  bua,  oallad 
tha  data  elreua,  la  Includad  In  tha 
procoaaor  archltoctura  allowing  tha  data 
Bovaaont  around  tha  parlphory  of  tha 
ayatoltc  array  atruotura. 
Coaaunleatlon  botwoan  tho  alaaanta  of  tha 
ayatollc  array  procaaalng  alaaanta  and  tha 
ayataa  control  aodula  la  handled  by  a 
global  bua  oallad  tha  Array  Control  Bua 
<ACB>.  Tha  hoat  eoaputar  coaaunicataa 
control,  diagnoatlca  and  data  to  tha  APM’a 
and  IOM*o  via  tho  SCM. 


Arlthaatlc  Procoaaor  Modulo 
Functional  Faaturaa 

Each  APK  la  coapoaod  of  330  off-tha- 
ahalf  Intagratad  ctrculta  and  haa  boon 
conatructad  on  a  le*  X  16”  wlra  wrap 
circuit  panel.  Figure  2  Identifiaa  tha 
aajor  functional  coaponanta  of  aach  APfl 
and  hlnta  to  tho  highly  parallal 
arehitactura  contained  within  aach  aodula. 


Figure  2  APH  Functional  Dlagraa 


Tho  coaputatlona I  power  of  the  APH  roaidea 
in  tho  ALU  which  ia  coapoaod  of  a  pair  of 
6  MHz  Weitak  floating  point  procoaoor 
chlpa,  tha  1033  aullpller  and  tha  1032 
ALU.  The  aodula  architecture  allowa  both 
of  theaa  chipa  to  parfora  alaultanaoua 
coaputatlona  on  aeparata  aata  of  oparanda 
while  coBBunlcation  to  neighboring 
procaealng  aodulea  nay  alao  occur.  Thla 
dagraa  of  parallellaa  la  achieved  by  tha 
uae  of  4  Veltek  regieter  file  chipa 
(1066),  which  la  a  five  ported  32  location 
32-blt  wide  acratch  pad  aeaory.  Bacauae  a 
total  of  128  (4  X  32)  locatlona  for 
operand  etorage  wae  daaaad  inadequate  to 


aupport  Boat  algnal  procaaalng  algorithaa, 
an  additional  4K  locatlona  of  alngle 
ported  aaaory  ia  Includad  In  thla  acratch 
pad  function. 

Tha  I/O  atructure  of  tho  APH  haa  been 
aada  highly  parallal  and  raconf i gurab I  a  to 
allow  tha  greateat  latitude  for  algorltha 
data  Bovaaantk  Each  raglatar  file  chip 
can  be  dedicated  to  data  aoveaant 
aaaoclatad  with  aach  adjacant  aodula. 
Thla  allowa  the  alaultanaoua  aoveaant  of 
up  to  four  different  data  packata  during  a 
aingla  tranafer  interval.  The  data 
tranafer  occur  at  tha  aaae  rate  aa  tha 
Internal  coaputat Iona  I  rata,  naaaly,  125 
nanoaeconda/tranaf er .  The  data  flow 
network  la  capable  of  aupportlng  a  nuabar 
of  topological  conf i gurat Iona.  A 
characteriatlc  of  aany  algnal  procaealng 
algorithaa  ia  tha  need  for  aoae  aort  of 
global  or  broadcaat  data  aovaaent.  Each 
APH  can  aupport  broadcaat  in  aeveral 
different  eonf i gurat Iona .  By  aoving  data 
through  the  data  flow  network  in  a 
tranaparant  aoda,  row,  coluan  and  diagonal 
broadcaatlng  la  aupportad  during  a  aingla 
clock  cycle. 

Tha  on-board  control  of  all  tha 
functional  alaaanta  of  tha  APH  daacribad 
ao  far  orlglnataa  froa  a  alcro- 
aaquencer/lnatructlon  RAH.  To  accoaaodate 
the  highly  parallel  nature  of  the  APH,  the 
Inatruction  word  contained  in  tha  RAM  ia 
176  blta  wide.  The  atcro-aequencer 
accapta  pointing  veetora  to  the  atartlng 
addreaa  of  dealred  prograa  aegaenta  via 
the  control  bua.  The  prograa  flow  can  bo 
Bodlfled  by  taating  tha  I/O  handahaklng, 
the  contenta  of  the  data  tag  byte, 
auxiliary  aode  regletera,  or  data  related 
arlthaetlo  operationa. 

The  final  functional  block  In  the  APH 
la  tha  diagnoatlca  interface  which  allowa 
the  ayataa  hoot  coaputer  to  load  or 
interrogate  APH  aeaory  ragiatara  and 
Internal  buaaa.  Thla  interface  ia  uaed 
for  auch  functlona  aa  down-loading 
Inatruction  and  data  and  haa  the 
additional  feature  of  aupportlng  initial 
debugging  efforta  and  operational 
confidence  teating/faul t  laolation. 


Input/Outout  Module 
Functional  Featurea 

Each  lOH  ia  coapoaod  of  190 
integrated  clrcuita  and  haa  been 
conatructad  on  a  B”  X  16”  wire  wrap 
circuit  panel.  Figure  3  Identlflea  the 
aajor  functional  coaponanta  of  each  of  4 
lOH’a  in  the  ayatea.  To  alnlalze  the 
coaplexlty  of  prpgraBmlng  and  the  hardware 
debug  cycle,  the  alcrocode  aequencer  and 


Flfur*  3  lOM  Functional  Dlagraa 


dlagnoatlcB  Iniarfaco  ara  Idantlcal  in 
function  to  thosa  uaad  In  tha  APH.  Tha 
lOM  contains  no  data  coaputatlona I 
circuitry  but  is  axprassly  daslgnad  to 
afflclantly  aova  data.  Tha  data  flow 
natwork  eonnacts  tha  data  prasant  at  tha 
boundary  of  tha  systolic  array  to  tha 
Intarnal  4K  buffar  aaaory.  Tha  lOH 
handshaking  and  transfsr  ratas  ara 
idantloal  to  tha  APH's  th  which  It  la 
connactad.  Each  of  tha  boundary  IOK*s  has 
2  non-systollc  ports  includad  which  sarva 
laportant  intarfaca  funtlons  In  tha 
application  of  tha  array  hardwara.  Tha 
axtarnal  port  coaes  coaplata  with  a 
saparata  sat  of  handshaking  signals  which 
allow  tha  Intalllgent  coaaunlcation  of 
data  with  axtarnal  hardwara  without 
Intarfarrlng  with  tha  systolic  aovsBant  of 
data  within  tha  array  itself.  Tha  data 
bus  (data  circus)  allows  tha  ton 
procasaors  to  act  as  a  distributed 
Intarfaca  aystaa.  Tha  registration  of 
data  Input  and  output  to  tha  array 
hardwara  with  tha  axtarnal  systaa  hardwara 
can  ba  prograaaad  Into  tha  lOFI  prograa. 


Tha  sen  Is  coaposad  of  140  intagratad 
circuits  and  Is  slallar  in  construction  to 
tha  ion's.  Figure  4  Identifies  its 
fune^tlonal  eoaponants  and  Its  relationship 
to  tha  other  systaa  eoaponants.  Tha  aaln 
function  of  the  SCn  Is  to  oonvsrt  an 
extension  of  tha  host  coaputar  bus  (16- 
bits)  to  a  foraat  used  In  tha  ACB  (65- 
bits  >.  The  host  coaputar  can  address  each 
of  the  diagnostic  and  control  raglstars 
contained  In  each  or  groups  of  APn’a 
and/or  lOn's.  Systaa  status  Including 
global  busy/raady,  arithaetlc  error 
detection,  or  systaa  Instruction  parity 
aaaory  fault  can  ba  aonitorad  by  tha  host 
coaputar  via  tha  SCn.  The  Increaantal 
algorltha  eoaaands  (tha  selection  of  tha 


Figure  4  SCh  Functional  Dlagraa 


desired  alcrocods  prograa  aodulas)  can  ba 
directed  at  tha  hardwara  nodules  of  tha 
array.  Tha  systaa  clock  originates  on  tha 
sen  board  and  a  separate  copy  of  tha  clock 
Is  sent  to  each  systaa  aodula.  This  clock 
is  progranaabls  In  spssd,  and  can  be 
Incraaanta I  I y  controlled  and  usad  during 
hardware  debugging  and  algorltha  napping. 
Tha  sen  Incorporates  the  circuitry  naadad 
to  allow  data  novaaent  batwaan  tha  host 
coaputar  and  any  one  of  tha  4  external 
ports.  This  faaturs  is  Includad  to  aid  In 
tha  Initial  aapplng  of  the  algorltha  in 
tha  absence  of  the  balance  of  tha  external 
systaa  hardware. 


The  systaa  hardware,  with  the 
axpaction  of  tha  host  coaputar,  is  housed 
In  a  24”  wide,  30"  deep  and  S6”  high 
aqulpaant  rack.  A  custoa  cardcaga 
coaplata  with  fans  was  contructad  which 
allows  tha  aountlng  of  tha  16  APH  cards  In 
tha  front  slda  of  tha  backplane  circuit 
card  and  tha  lOn's  and  SCM  in  tha  back. 
Dua  to  the  nuabar  of  wide  parallel  buses 
and  high  clock  speed  of  the  array,  all  the 
systolic  connectivity  Is  contained  in  one 
19”  X  24”  10- layer  circuit  card.  A 
special  power  distribution  grid 
constructed  froa  copper  bar  stock  was 
attached  to  the  backplane  to  acconaodate 
the  current  load  of  the  present  systea 
configuration  (6500  ICs)  and  future 
anhanceaents  up  to  400  aaps. 

A  secondary  aultlbus  oardcage  has 
been  Included  In  the  aqulpaent  cabinet  to 
acconaodate  data  acquisition  systaa 
coaponents  and  possible  future  use  of  a 
single  board  coaputar  to  replace  the 
present  IBH-AT  host. 
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